-
Notifications
You must be signed in to change notification settings - Fork 125
Whitelisting Onelake API & Workspace PL FQDNs #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SmritiAgrawal04
This PR needs some tests to show it working I think
crepererum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do somewhat agree with @alamb. I know it's close to impossible to write an integration test for this, but maybe we can at least have a unit test for parse_url?
src/azure/builder.rs
Outdated
| let first_label = host.split('.').next().unwrap_or_default(); | ||
| self.account_name = Some(validate(first_label)?); | ||
|
|
||
| let container = parsed.path_segments().unwrap().next().expect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: onelake uses workspace terminology
src/azure/builder.rs
Outdated
| // Regex to match WS-PL FQDN: "{workspaceid}.z??.dfs.fabric.microsoft.com" | ||
| // workspaceid = 32 hex chars, z?? = z + first two chars of workspaceid | ||
| lazy_static::lazy_static! { | ||
| static ref WS_PL_REGEX: Regex = Regex::new(r"^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(dfs|blob)\.fabric\.microsoft\.com$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add support for .onelake.fabric.microsoft.com also
src/azure/builder.rs
Outdated
| let xy = captures.name("xy").unwrap().as_str(); | ||
|
|
||
| // Validate z?? matches first 2 chars of workspaceid | ||
| if &workspaceid[0..2] != xy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this validation
src/azure/builder.rs
Outdated
| } | ||
|
|
||
| // Otherwise, check Fabric global / Onelake API FQDN | ||
| if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are checking for global endpoint, should these not be with not like (!host.ends_with(DFS_FABRIC_SUFFIX) && !host.ends_with(BLOB_FABRIC_SUFFIX))
src/azure/builder.rs
Outdated
| if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) { | ||
| let labels: Vec<&str> = host.split('.').collect(); | ||
| let account_name = if labels.len() >= 2 && labels[0].contains("api") && labels[1] == "onelake" { | ||
| format!("{}-{}", labels[0], labels[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are referring workspace id as account_name, then this will not work for non pl scenario. In those case it will give account_name as "westus-api-onelake"
|
Hi @alamb & @crepererum, I added the unit tests as suggested. I request to review the PR please. |
| self.container_name = Some(validate(parsed.username())?); | ||
| self.account_name = Some(validate(a)?); | ||
| self.use_fabric_endpoint = true.into(); | ||
| } else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft doc so we can cite it in code/tests?
I don't see it in https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding I think. Is this a vaild url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is a valid URL- added in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SmritiAgrawal04 -- i left some more questions
BTW I asked codex for a review of this PR like this
please review this PR (checked out in this directory): https://github.com/apache/arrow-rs-object-store/pull/552
Specifically ensure that the format of the URLs supported matches publically published microsoft urls and provide documentation references
And this was the output which I found helpful
───────────────────────────────────────────────────────────────────────────────────────────────
• Findings
- High: -api.onelake.fabric.microsoft.com is accepted for both abfs(s) and https, but I can’t find this FQDN in
the public OneLake URL docs. The published endpoints are onelake.dfs.fabric.microsoft.com and workspace-scoped
https://{workspaceId}.z{xy}.dfs.fabric.microsoft.com/... (see https://learn.microsoft.com/en-us/fabric/onelake/
onelake-access-api). If -api.onelake is internal or not publicly documented, this should be gated or removed. - Medium: The WS‑PL regex no longer verifies that z{xy} matches the first two hex chars of the workspace id (src/
azure/builder.rs:692-708). The published WS‑PL format is https://
{workspaceId}.z{first2}.dfs.fabric.microsoft.com (same doc). Without the check, non‑published FQDNs are
accepted. - Medium: WS‑PL regex allows .onelake.fabric.microsoft.com (src/azure/builder.rs:692-699). The public docs list
dfs (and blob endpoints for blob access), but I don’t see onelake as a WS‑PL subdomain. Please confirm with a
Microsoft reference or remove to match published formats. - Low: WS‑PL regex only allows lowercase hex ([0-9a-f]{32}) (src/azure/builder.rs:697). If Microsoft treats
workspace IDs as case‑insensitive GUIDs in URLs, this will reject valid uppercase forms; consider documenting
that only lowercase is supported or make the regex case‑insensitive.
Doc references
- OneLake access API URL formats: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
- Private Link / workspace-specific OneLake endpoints (if this PR targets WS‑PL): please confirm the exact doc
section that defines the workspaceId.z{xy}.dfs.fabric.microsoft.com pattern and whether any onelake WS‑PL host
is documented (I could not find it in public docs).
Questions / assumptions
- Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft
doc so we can cite it in code/tests? - Should WS‑PL accept only dfs/blob subdomains, or is .onelake.fabric.microsoft.com explicitly published?
|
FYI #604 may be related. |
Hi @alamb, I have addressed all comments. About, finding #1 & #3, we plan to add it to the public documentation, the PR for which is already out. I request to approve these changes meanwhile. Thanks |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this @SmritiAgrawal04
I am sorry for the length of time but our review bandwdith is very limited here
Part of the reason it has taken so long to review is there is no ticket or description of what issue this is fixing, nor good examples of what new features you have added
In order to proceed, please:
- Fill out the PR description with a clear description of what problem you are solving. This shoud, at a minimum,, provide URLs that can't be parsed with the current crate, and examples of the expected output
- Provide a link to the documentation of the format of onelake URLs so that we can verify the URLs should be parsed
- Add a link to the docs / references in the code so that future readers can refer back to them
| self.container_name = Some(validate(parsed.username())?); | ||
| self.account_name = Some(validate(a)?); | ||
| self.use_fabric_endpoint = true.into(); | ||
| } else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding I think. Is this a vaild url?
| builder | ||
| .parse_url("https://Ab000000000000000000000000000000.zAb.dfs.fabric.microsoft.com/") | ||
| .unwrap(); | ||
| assert_eq!(builder.account_name, Some("ab000000000000000000000000000000.zab".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the account name really supposed to have the ab000000000000000000000000000000 in it? It seems confusing that the container name is also ab000000000000000000000000000000
|
|
||
| let ws_pl_regex = WS_PL_REGEX.get_or_init(|| { | ||
| Regex::new( | ||
| r"(?i)^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(onelake|dfs|blob)\.fabric\.microsoft\.com$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for using regexp math here rather than just hostname splitting? The examples / tests you have added don't seem to test for invalid URLs / that don't follow a simple .... format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Workspace private link FQDNs follow a pattern unlike other FQDNs that had been whitelisted earlier. WS‑PL hostnames have a very specific, structured format with invariants that we want to validate, not merely parse. If you verify the FQDN syntax here , you'd note that a simple split('.') can tell “there are N labels” and extract strings, but it doesn’t naturally enforce:
- wsid is exactly 32 hex chars,
- xy is exactly 2 hex chars,
- and xy == wsid[0..2].
- Added the test cases for invalid ws-pl fqdn.
|
Hey folks, thanks for taking the time to review this PR. |
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?
Yes. The documentaion is as below: